Rolling Stone Magazine assembled a list of 100 greatest artists across a broad span of musical genres in 2011. This report attempts to explore the enduring engagement of these artists and their works at the end of 2023. To develop this, I will retrieve the artists’ Spotify metrics and YouTube video stats as my analytical dataset, and interpret the findings with data visualisations.
The project is to extract the rankings and names of the top 100 artists by scraping Rolling Stone’s webpage. The task involves retrieving data from two separate pages and merging it into a single data frame of 2 columns and 100 rows.
In this report, each artist’s Spotify ID will be fetched to obtain follower counts and popularity scores (both the artists and their top tracks) using API calls, offering insight into their current media presence.
Although the data provided by Spotify can interpret the general level of the artists’ enduring engagement, more detailed quantitative evidence and audience feedback is needed. Therefore, official YouTube channel statistics for 20 representative artists will be gathered. This includes views, comments, likes, favourites, and comment texts. Randomly sampling is expected here due to daily API query limits.
The final dataset consists of two tables:
The Spotify Data: 9 columns and 100 rows
The YouTube Data: 8 columns and 20 rows.
The first few rows of the final structured Spotify data can be found as follows. In this section, I will visualise the data to interpret the artists’ current endurance.
# Review the first few rows of the updated dataframe.
data <- read.csv("artists_spotify_data")
head(data)
## Name Rank id followers popularity
## 1 The Beatles 1 3WrFJ7ztbogyGnTHbHJFl2 26763532 83
## 2 Bob Dylan 2 74ASZWbe4lXaubB36ztrGX 6326699 70
## 3 Elvis Presley 3 43ZHCT0cAZBISjO8DG9PnE 9020328 81
## 4 The Rolling Stones 4 22bE4uQ6baNwSHPVcDxLCe 13607260 78
## 5 Chuck Berry 5 293zczrfYafIItmnmM3coR 1917547 72
## 6 Jimi Hendrix 6 776Uo845nYHJpNaStv1Ds4 6493958 68
## top_track_name top_track_id track_popularity
## 1 Here Comes The Sun - Remastered 2009 6dGnYIeXmHdcikdzNNDMm2 85
## 2 Knockin' On Heaven's Door 6HSXNV0b4M4cLJ7ljgVVeh 77
## 3 Blue Christmas 3QiAAp20rPC3dcAtKtMaqQ 90
## 4 Paint It, Black 63T7DJ1AFDD6Bn8VzG6JE8 84
## 5 Run Rudolph Run 2pnPe4pJtq7689i5ydzvJJ 89
## 6 All Along the Watchtower 2aoo2jlRnM3A0NyLQqMN2f 77
## album_id
## 1 0ETFjACtuP2ADo6LFhL6HN
## 2 2Pj2kZM5XpyIeyFBTAVulL
## 3 6zk4RKl6JFlgLCV4Z7DQ7N
## 4 72qrnM4yUNMDDlWiqKc8iY
## 5 1DILNh7maaYyKxe15V9xLq
## 6 5z090LQztiqh13wYspQvKQ
(Note: all charts here are interactive, hover to view each detailed data.)
In the graph P1, the bar chart shows the number of followers of each artist on Spotify. An artist with more followers could be asserted to obtain a larger fan base and a stronger maintenance potential. Notably, the number of followers does not have a linear relationship with the rankings and varies extremely from one another. Artists such as Eminen, Queen, and The Beatles have a follower index that far exceeds any others. Eminen, especially, boasts nearly 80 million followers despite being ranked 83. But it is also evident that most artists’ fan base is under 10 million, with the lowest number of just 10 thousand, which is not exactly competitive in today’s music platforms.
The heat maps indicate the Spotify popularity scores for the artists and their top tracks, with deeper colours representing higher popularity index. P2-2 and P2-3 are rearranged from P2 in descending order of popularity, providing a clearer vision of the distribution of this value across the cohort. We can assert that most of these artists are still celebrated today, with 90% receiving a popularity score of over 50 and 26% over 75. By comparing the colour distributions of P2-2 and P2-3, it is surprising to find that the top tracks of these artists enjoy a higher degree of popularity, with over a half of them surpassing the popularity score threshold of 75.
These findings suggest that even without a massive fanbase, these artists rated as the greatest continue to gain adoration, and their tracks are still racking up enduring heat today, being added to playlists and looped by countless listeners.
To gather more quantitative metrics, I will use YouTube API to get the latest video data posted on the official channels of these artists. Here is an overview of the data frame.
# View the final data.
load("youtube_sample_data.RData")
head(video_data2)
## Name
## 1 The Beatles
## 2 Jimi Hendrix
## 3 Bob Marley
## 4 Sam Cooke
## 5 Otis Redding
## 6 The Ramones
## VideoTitle
## 1 “We were trying to think of how to make certain sounds that weren’t available…” - George #RedandBlue
## 2 Angel
## 3 Half the story has never been told. New ‘Bob Marley: One Love’ trailer out TOMORROW. #bobmarley
## 4 Touch The Hem Of His Garment
## 5 Hard to Handle (Karaoke Version) (Originally Performed By Otis Redding)
## 6 Ramones - She's The One (Official Music Video)
## VideoID ViewCount CommentCount LikeCount FavoriteCount
## 1 i0azIt4O9aA 37594 61 4351 0
## 2 b-8Azzi9vMA 109998 7 3199 0
## 3 WytTz95Z6kc 58104 140 4335 0
## 4 YUTzto-T7P0 81 0 2 0
## 5 S8oMh_OWj8w 217 0 4 0
## 6 g7KrBgnjaLY 1401497 647 26600 0
## Comments
## 1 'Revolver' is a really great album and George's track 'Taxman' (which opens side one of the LP) never fails to make me laugh due to it's witty lyrics. I bet all four of them had a lot of fun recording that one.\nTaxman the Song from George, Revolver Remix the best 4 this Song. \nLg Ellen ✌❤\nThe Beatles espetacular, sensacional. The Beatles para sempre!\nAnd stealing the usage of horns, trumpets, riffs off of Frank Zappa’s first album ‘Freak Out,’ for the making of Sgt. Pepper. \n\nIn the long run it didn’t matter. Frank got ‘em’ back on his 3rd album; ‘we’re only in it for the money.’ To name just one. Oh No!\n\nAnd I love the Beatles…my favorite group as a youngster.\nTetszik\n"Taxman" by George Harrison is a great lovely song,one of his masterpiece.The Beatles were always inventing new sounds and creating .\nIts always dangerous when Tavistock controls you.\nUsed to gig tax man with a covers band about 25 years back great tune 👍\nThe truth is that you are and were geniuses\nOhh I love George’s voice. Ty and happy new year 🎉🎉🥂\nI don't know about you guys but having all these Beatles tapes on U tube sure does make me think that George and John are still with us and that is comforting for us old fans who grew up with them. So thank you!\nIt's in the days of sixties.\nGetting dangerous!! George Harrison is very honest here. The Beatles had serious death threats in the USA in 1966..From what Ive heard these death threats were not made by looney cranks. They came from the top of officialdom and were taken so serious by those in the know, that the Beatles played their last US tour in summer 1966..After that they never toured again as Beatles.. John Lennon was shot dead in the USA in 1980,just as he was talking in the American press of touring again.. Then the American "Lone Nut" assassinated him. That "American Lone Nut" has become a joke amongst people who dig deeper.. That American Officialdom has a lot to answer for.. No no no. Officialdom got away with murder and not for the first time.\nTo me the “only studio” was the problem. They were great performing together. There would be no time to grow bad feelings amongst each other.\n🎉For all in Japan🎉\nOur Love Prayers for all of you🎉\nFor all #5 who perished in the firey plane crash God Bless🎉 your Families🎉\nA.Slone of ♥\nSongs 🎶 Great \nSad were missing George John🎉🎉🎉\n'LOVE🎉\nOk back to work meeting with my Fellows🎉🎉🎉\n\nBe kind in kind to each other🎉\nHappy New Year!🎉\nA. Slone of ♥\nEl gran George ❤\nThey used to play Eight days a week in Hamburg\n🔥\n❤
## 2
## 3 Bob Marley experience your video on our channel go check it out!\nIf you were proud, I am proud for you and your family. I was not paying close attention when your father was alive, but I know he is a legend. I am paying close attention now. Congratulations on the success of this movie. I cannot wait to see it.\nOne love\nCan’t wait to see this and Ziggys thoughts really help . \nSo happy he was happy with his father’s portrayal. ❤ONE LOVE\nSkip Marley looks more sound more and more of bob marleys mannerisms\nFebruary 🎉,💯💯💯💯\n❤❤❤❤\n👁️🎶👁️🙏👁️✨🙏👁️🗨️🌞👁️💥👁️💛🇯🇲💚🤎🖤💥💜💙💥🤍🧡💛❣️💥💓🖤✨✨✨👁️❣️👁️🙋♀️\nJust the trailer made me tear up I know I’m gonna cry when I watch this 🥹♥️\nIt's about time. I'm glad, finally a legend movie. The world needs to see this 👌🏾\n❤\nFinally a movie about the king Bob Marley❤️💛💚One Love\nEndorsement of supreme 🎉🎉🎉\nHe should have played him. He looks exactly like Marley..😊❤\nBob marley king of Reggae one love RIP Africa loves you\nThis Film is Long Overdue and I can't wait to see it\nAncioso pra assistir o filme 😊\nNesta's life story February 14 UK cannot wait IAH 🫶🫶🫶\nrastafari the people love robert nesta\nNao me canso de chorar só no trailer imagina no dia do filme
## 4
## 5
## 6 Thought Howard Stern was part of the band lol\nOne of the best!\nWhen Just and Rightous Gods busst the cork on Justice, to spill flood with Life and\nSaludos a Peter Andy y Lutz de la embajada Alemana (Año 1980) Los Punkies de Argentina de LANUS (Buenos Aires) Claudio Gabriel Varicela Mongo Marcelo Manners Jose Javier ❤❤❤❤❤🤩🤩🤩🤩🤩😎🔭\nits incredible how people still lisen to this incredible band many years later. its sad there not as admired how they should be\nThis track leans more to a Rock n Roll style and feel, with a punk beat. Great to hear. A classic never sounds old.\nCheck out the hardcore punk band Bad Brain. Specifically their 2nd album "Rock for Light" which released in 1981. Dave Grohl hailed it as one of his favorite albums of all time.\nHow much rape have the English girls lived to remove Goñs.\n\n\nFxck me Jesus.\n\nRight.\n\nSo (Seriously) Sorry English Girls.\nMagical time to be alive...\nHoward stern\nSimple lyrics chords and Riffs Steady Beats and the \n Same wardrobe !\nSOMEBODY STOLE MY FUCKIN' WORDS. DAMN, FREAK, BITCH AND LOSER. GO TO HELL AND NEVER COME BACK.\nCan't believe all of them are gone 💔 but the name still remains alive\nForce of Nature\nOne of my favorite songs in high school. My best friend had a cassette player and made copies for all of us .. those were the days.\n💛🍀➕️🖤💬🚫1️⃣🎲\nThey should've called it Disco High but kept Ramones\nWTF? The emperor has no clothes.\nExtremely overrated band ever.\nザ ロッカーズのモーターサイクルは、ほぼこれとメロディ一緒やな😂\nKILLER!!!)'(
The bar charts below illustrate the statistics of the latest YouTube video of each sample artist. Comparing the two graphs we can find that the distribution of likes and views is roughly in the same trend, with The Ramones and Madonna ranking the highest. Like previous Spotify follower statistics, only a handful of artists are enjoying high streaming flow for their work now. However, this doesn’t mean they’re being forgotten, as the central word of the final word cloud derived from the comment text of these video shows, they’re still being “loved”.
In conclusion, although the Greatest 100 Artists may not be as widely followed as in the past, they continue to hold a special place in the musical landscape, with their works transcending time and space, underscoring their enduring impact in the world of music. Their music tracks are celebrated by numerous listeners today, as evidenced by statistics on Spotify, and data from YouTube that demonstrate the long-lasting love from their dedicated audience.
knitr::opts_chunk$set(warning = FALSE)
knitr::opts_chunk$set(message = FALSE)
knitr::opts_chunk$set(echo = FALSE)
# Load packages
library(rvest)
library(xml2)
library(dplyr)
library(spotifyr)
library(jsonlite)
library(httr)
library(tidyverse)
## Step 1: scrape the 100 greatest artists data from Rolling Stone webpage.======
# Scrape the first page's 50 artists.
# Navigate to the webpage.
website <- "https://www.rollingstone.com/music/music-lists/100-greatest-artists-147446/"
# Read the page's html.
res <- read_html(website)
# Get the 51-100 artists' data.
rank_51_100 <- res %>%
html_nodes("h2") %>%
html_text() %>%
as.data.frame() %>%
head(50)
colnames(rank_51_100) <- "Name" #Rename the column.
rank_51_100$Rank <- 100:51 #Add ranking numbers.
# Navigate to the next page which contains 1-50 artists' data.
website2 <- "https://www.rollingstone.com/music/music-lists/100-greatest-artists-147446/the-band-2-88489/"
res2 <- read_html(website2)
rank_1_50 <- res2 %>%
html_nodes("h2") %>%
html_text() %>%
as.data.frame() %>%
head(50)
colnames(rank_1_50) <- "Name" #Rename the column.
rank_1_50$Rank <- 50:1 #Add ranking numbers.
# Combine two data frames into one single table.
greatest_100_artists <- bind_rows(rank_51_100, rank_1_50) %>%
arrange(Rank)
## Step 2: retrieve more details for analysis from Spotify API.================
# Set up my Spotify client id and secret by reading from a local ".env" file.
readRenviron("E:/文件/LSE/MY472/spotify id&secret.env")
id <- Sys.getenv("id")
secret <- Sys.getenv("secret")
Sys.setenv(SPOTIFY_CLIENT_ID = id)
Sys.setenv(SPOTIFY_CLIENT_SECRET = secret)
# Set the access token.
access_token <- get_spotify_access_token()
# Create a function to retrieve an artist's spotify id from api.
get_spotify_id <- function(artist_name) {
# To avoid timeout issues, I create a retry mechanism system here, which contains a loop combined with tryCatch and Sys.sleep to make repeated attempts at the API call.
attempt <- 1 # Set the minimum number of retry attempts.
max_attempts <- 5 # Set the maximum number of retry attempts.
# Write a loop to retrieve data while avoiding timeout errors.
while (attempt <= max_attempts) {
try({
# Set up the api endpoint.
search_url <- paste0('https://api.spotify.com/v1/search?q=', URLencode(artist_name), '&type=artist&limit=1')
# Get response.
response <- GET(search_url, timeout(30), add_headers(Authorization = paste('Bearer', access_token)))
if (status_code(response) == 200) {
search_results <- content(response, as = "parsed", type = "application/json")
# Here I add an if sentence to deal with potential NA values.
if (length(search_results$artists$items) > 0) {
return(search_results$artists$items[[1]]$id)
} else {
return(NA)
}
}
}, silent = TRUE)
attempt <- attempt + 1
Sys.sleep(5) # pause for 5 seconds before retrying
}
return(NA)
}
# Improve the function into another function that retrieve a list of artists' spotify id.
get_ids <- function(rank) {
for (i in 1:nrow(rank)) {
name <- rank$Name[i]
rank$id[i] <- get_spotify_id(name)
}
return(rank)
}
# Get all the artists' spotify ids and add them into my dataframe.
greatest_100_artists <- get_ids(greatest_100_artists)
# Retrieve followers and popularity index.====================================================
# Function to get an artist's followers and popularity index.
get_artist_details <- function(id, token) {
# Set the url endpoint.
url <- paste0("https://api.spotify.com/v1/artists/?ids=", id)
# To avoid timeout issues mentioned before, here I create a retry mechanism as well.
attempt <- 1 # Set the minimum number of retry attempts.
max_attempts <- 5 # Set the maximum number of retry attempts.
while (attempt <= max_attempts) {
response <- tryCatch({
# Get API response.
GET(url, timeout(30), add_headers(`Authorization` = paste("Bearer", token)))
}, error = function(e) { NULL })
# Check if the request was successful
if (!is.null(response) && status_code(response) == 200) {
# Retrieve text details from a json file we got.
details <- fromJSON(content(response, "text", encoding = "UTF-8"))
# Extract the followers and popularity data.
followers <- details$artists$followers$total
popularity <- details$artists$popularity
# Create a new data frame to storage the retrieved data.
return(data.frame(followers = followers, popularity = popularity))
} else {
message(sprintf("Attempt %d failed. Retrying in %d seconds...", attempt, attempt))
Sys.sleep(attempt) # Exponential back-off
attempt <- attempt + 1
}
}
# If all attempts failed, output warning message and return NA values.
warning("All attempts failed.")
return(data.frame(followers = NA, popularity = NA))
}
# Create a function to get an artist's top track and its related details.
get_top_track <- function(artist_id) {
top_track <- data.frame()
tracks <- get_artist_top_tracks(artist_id)
top_track <- data.frame(
top_track_name = tracks$name[1],
top_track_id = tracks$id[1],
track_popularity = tracks$popularity[1],
album_id = tracks$album.id[1]
)
return(top_track)
}
# Mapping the artists dataframe to add followers and popularity details.
for (i in 1:nrow(greatest_100_artists)) {
id <- greatest_100_artists$id[i]
df <- get_artist_details(id, access_token)
greatest_100_artists$followers[i] <- df$followers[1]
greatest_100_artists$popularity[i] <- df$popularity[1]
}
# Mapping the artists dataframe to get their top tracks and popularity scores.
tracks <- data.frame()
for (i in 1:nrow(greatest_100_artists)) {
id <- greatest_100_artists$id[i]
if (length(get_top_track(id)) == 0) {
tracks <- rbind(tracks, rep(NA, ncol(tracks)))
}
else {tracks <- rbind(tracks, get_top_track(id))}
}
# Combine the data.
greatest_100_artists <- cbind(greatest_100_artists, tracks)
# Write this data into my local storage for future usage.
write.csv(greatest_100_artists, "artists_spotify_data", row.names = FALSE)
# Review the first few rows of the updated dataframe.
data <- read.csv("artists_spotify_data")
head(data)
## Visualisation ==========================================================
# Load "plotly" package for generating interactive plots.
library(plotly)
library(ggplot2)
#install.packages("highcharter")
library(highcharter)
# Convert the Name column to a factor and keep the original order.
data$Name <- factor(data$Name, levels = unique(data$Name))
# Use highcharter to create an interactive bar chart.
hchart(data, "bar", hcaes(x = Name, y = followers)) %>%
hc_title(text = "P1. The Spotify Followers of the 100 Greatest Artists") %>%
hc_xAxis(labels = list(style = list(fontSize = '8px')))
# Create a heat map to show the popularity of all artists.
data$hover_info2 <- paste("Name:", data$Name, "<br>Popularity:", data$popularity, "<br>Rank:", data$Rank) # Set the hover information.
data$hover_info3 <- paste("Name:", data$Name, "<br>Track Popularity:", data$track_popularity, "<br>Rank:", data$Rank) # Set the hover information.
library(tidyverse)
# Compare the tracks' popularity with the artists' popularity scores.
# Transform the data to a longer format.
data_long <- data %>%
pivot_longer(cols = c(popularity, track_popularity),
names_to = "popularity_type") %>%
na.omit()
# Draw a heat map comparing the popularity between the artists' and their top tracks.
hchart(data_long, "heatmap", hcaes(x = Name, y = popularity_type, value = value)) %>%
hc_xAxis(categories = unique(data_long$Name)) %>%
# Adjust the color scale.
hc_colorAxis(minColor = "white",
maxColor = "#DC143C") %>%
# Adjust the hover information.
hc_tooltip(formatter = JS("function () {
return '<b>' + this.series.xAxis.categories[this.point.x] +
'</b>, Value: <b>' +
this.point.value.toFixed(2) + '</b>';
}")
) %>%
# Customize axes and labels.
hc_xAxis(title = list(text = "Name")) %>%
hc_yAxis(title = list(text = "Popularity Type")) %>%
# Customize the title of the legend.
hc_legend(title = list(text = "Popularity Index")) %>%
# Add a title.
hc_title(text = "P2-1. Comparison of Popularity Scores Between Artists and Their Top Tracks")
# Create some popularity thresholds for analysis.
data_reverse <- data[order(-data$popularity),] # Reorder the artists data.
data_reverse$Name <- factor(data_reverse$Name, levels = data_reverse$Name) # Convert the Name column into a factor.
divider_position <- which(data_reverse$popularity == 50)[1] # Find the first threshold where popularity score equals to 50.
divider_position2 <- which(data_reverse$popularity == 75)[2] # Find another threshold where popularity score equals to 75.
# Create another two popularity threshold points.
data_reverse2 <- data[order(-data$track_popularity),] # Reorder the artists data.
data_reverse2$Name <- factor(data_reverse2$Name, levels = data_reverse2$Name) # Convert the Name column into a factor.
track_divider_position <- which(data_reverse2$track_popularity == 50)[1] # Find the first threshold where track popularity score equals to 50.
track_divider_position2 <- which(data_reverse2$track_popularity == 75)[5] # Find another threshold where track popularity score equals to 75.
# Create another heat map to show the popularity of all artists after sorting in reverse order of the popularity index.
# This is done for comparing the differences between the rankings and popularity of all artists.
heatmap_compare <- ggplot(data, aes(x = reorder(Name, -popularity), y = 1, fill = popularity, text = hover_info2)) +
geom_tile() + # Create a tile plot.
# Add vertical lines to the graph as thresholds.
geom_vline(xintercept = divider_position, color = "darkblue", linetype = "dashed", linewidth = 0.3) +
geom_vline(xintercept = divider_position2, color = "darkblue", linetype = "dashed", linewidth = 0.3) +
labs(title = "P2-2. The Spotify Popularity Score of the 100 Greates Artists", x = "Name", y = "", fill = "Popularity") + # Set the labs.
# Add annotations.
annotate("text", label = "Popularity: 50", x = divider_position, y = 1, angle = 90, vjust = -0.5, color = "darkblue", size = 3) +
annotate("text", label = "Popularity: 75", x = divider_position2, y = 1, angle = 90, vjust = -0.5, color = "darkblue", size = 3) +
annotate("text", label = "26%", x = divider_position2/2, y = 1.25, color = "darkgreen") +
annotate("text", label = "64%", x = (divider_position+divider_position2)/2, y = 1.25, color = 'darkgreen') +
annotate("text", label = "10%", x = (divider_position+100)/2, y = 1.25, color = 'darkgreen') +
theme_minimal() + # Set the theme.
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1, size = 6), # Set the text of the x axis.
axis.text.y = element_blank(), # Hide the text of the y axis.
axis.ticks.y = element_blank(), # Hide the scale on the y axis of the chart.
axis.title.y = element_blank(), # Hide the title of the y axis.
axis.title.x = element_blank(), # Hide the title of the x axis.
panel.grid = element_blank()) + # Hide the grid.
scale_fill_gradientn(colors = c("white", "lightyellow", "red")) # Specify a gradient fill.
ggplotly(heatmap_compare, tooltip = "text") # Display the interactive plot.
# Create another heatmap, indicating the track popularity.
heatmap_compare2 <- ggplot(data, aes(x = reorder(Name, -track_popularity), y = 1, fill = track_popularity, text = hover_info3)) +
geom_tile() + # Create a tile plot.
# Add vertical lines to the graph as thresholds.
geom_vline(xintercept = track_divider_position, color = "darkblue", linetype = "dashed", linewidth = 0.3) +
geom_vline(xintercept = track_divider_position2, color = "darkblue", linetype = "dashed", linewidth = 0.3) +
labs(title = "P2-3. The Spotify Popularity Score of the 100 Greates Artists' Top Tracks", x = "Name", y = "", fill = "Popularity") + # Set the labs.
# Add annotations.
annotate("text", label = "Popularity: 50", x = divider_position, y = 1, angle = 90, vjust = -0.5, color = "darkblue", size = 3) +
annotate("text", label = "Popularity: 75", x = track_divider_position2, y = 1, angle = 90, vjust = -0.5, color = "darkblue", size = 3) +
annotate("text", label = "54%", x = track_divider_position2/2, y = 1.25, color = "darkgreen") +
annotate("text", label = "41%", x = (track_divider_position+track_divider_position2)/2, y = 1.25, color = 'darkgreen') +
annotate("text", label = "5%", x = (track_divider_position+100)/2, y = 1.25, color = 'darkgreen') +
theme_minimal() + # Set the theme.
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1, size = 6), # Set the text of the x axis.
axis.text.y = element_blank(), # Hide the text of the y axis.
axis.ticks.y = element_blank(), # Hide the scale on the y axis of the chart.
axis.title.y = element_blank(), # Hide the title of the y axis.
axis.title.x = element_blank(), # Hide the title of the x axis.
panel.grid = element_blank()) + # Hide the grid.
scale_fill_gradientn(colors = c("white", "lightyellow", "red")) # Specify a gradient fill.
ggplotly(heatmap_compare2, tooltip = "text") # Display the interactive plot.
# Load the packages.
#library(httr)
#library(jsonlite)
# Set up my YouTube data API key.
readRenviron("E:/文件/LSE/MY472/youtube api.env")
youtube_key <- Sys.getenv("KEY")
# Create a function to fetch YouTube channel ID
search_youtube_channel <- function(artist_name, api_key) {
base_url <- "https://www.googleapis.com/youtube/v3/search"
query <- list(part = "snippet", q = artist_name, type = "channel", key = api_key)
response <- GET(url = base_url, query = query)
data <- content(response, "parsed")
# Handle the NA values.
if (length(data$items) > 0) {
return(data$items[[1]]$snippet$channelId)
} else {
return(NA)
}
}
# Get the channel ids by searching artists' names.
channel_ids <- lapply(data$Name, search_youtube_channel, youtube_key)
ids <- unlist(channel_ids)
yt_channel <- data.frame(names = data$Name, ids = ids)
# Non-return sampling.
# Generate a sampling index (start at 1 and draw every 4).
sampling_indices <- seq(1, nrow(yt_channel), by=5)
# Use the index to select from the sample.
selected_samples <- yt_channel[sampling_indices, ]
# Store the sampled table as a local file. Because the YouTube API has a ration for tokens everyday.
write.csv(selected_samples, "selected_samples", row.names = FALSE)
sampled_channel <- read.csv("selected_samples")
# Write a function to get video statistics.
get_video_statistics <- function(video_id, api_key) {
stats_base_url <- "https://www.googleapis.com/youtube/v3/videos"
stats_params <- list(
part = "statistics",
id = video_id,
key = api_key
)
stats_response <- GET(url = stats_base_url, query = stats_params)
stats_content <- fromJSON(rawToChar(stats_response$content), flatten = TRUE)
# Handle potential NA values.
if (length(stats_content$items) == 0) {
return(list(
ViewCount = NA,
CommentCount = NA,
LikeCount = NA,
FavoriteCount = NA
))
}
# Ensure that each statistic is available, else NA. And collect the view counts, comment counts, like counts and favourite counts.
view_count <- ifelse(!is.null(stats_content$items$statistics.viewCount), stats_content$items$statistics.viewCount, NA)
comment_count <- ifelse(!is.null(stats_content$items$statistics.commentCount), stats_content$items$statistics.commentCount, NA)
like_count <- ifelse(!is.null(stats_content$items$statistics.likeCount), stats_content$items$statistics.likeCount, NA)
fav_count <- ifelse(!is.null(stats_content$items$statistics.favoriteCount), stats_content$items$statistics.favoriteCount, NA)
# Structure the retrieved data as a data frame.
video_stats <- data.frame(Like_Count = like_count,
View_Count = view_count,
Comment_Count = comment_count,
Fav_Count = fav_count)
return(video_stats)
}
# Write a function to get the latest videos and their statistics.
get_latest_videos <- function(channel_id, api_key) {
base_url <- "https://www.googleapis.com/youtube/v3/search"
# Set the parameters.
params <- list(
part = "snippet",
channelId = channel_id,
maxResults = 1,
order = "date",
type = "video",
key = api_key
)
response <- GET(url = base_url, query = params)
content <- fromJSON(rawToChar(response$content), flatten = TRUE)
# Handle potential errors.
if (!"items" %in% names(content)) {
stop("No items found in API response.")
}
# Get the video's id, title, and statistics.
video_id <- content$items$id.videoId
video_title <- content$items$snippet.title
statistics <- get_video_statistics(video_id[1], api_key)
# Create a new data frame as final output.
video_details <- data.frame(
Name = sampled_channel$names[which(sampled_channel$ids == channel_id)],
VideoTitle = video_title,
VideoID = video_id,
ViewCount = statistics$View_Count,
CommentCount = statistics$Comment_Count,
LikeCount = statistics$Like_Count,
FavoriteCount = statistics$Fav_Count,
stringsAsFactors = FALSE
)
return(do.call(cbind, video_details))
}
# Create a blank data frame for storage.
video_data2 <- data.frame()
# Fetch the latest video for each artist
for (i in 1:nrow(sampled_channel)) {
channel_id <- sampled_channel$ids[i]
videos2 <- get_latest_videos(channel_id, youtube_key)
# Process the videos data.
video_data2 <- rbind(video_data2, videos2)
}
# Create a function to get the first ten comments of each video.
get_comments <- function(video_id, api_key) {
url <- paste0("https://www.googleapis.com/youtube/v3/commentThreads?key=", api_key,
"&textFormat=plainText&part=snippet&videoId=", video_id,
"&maxResults=20") # Request for the top 20 comments.
response <- GET(url)
content <- fromJSON(rawToChar(response$content))
# Extract the comments.
comments <- content$items$snippet$topLevelComment$snippet$textDisplay
combined_comments <- paste(comments, collapse = "\n")
return(combined_comments)
}
# Create a list for storage.
Comments <- list()
# Get the top few comments for each video.
for (i in 1:nrow(video_data2)) {
comments <- get_comments(video_data2$VideoID[i], youtube_key)
Comments[i] <- comments
}
video_data2$Comments <- Comments
# Write the data to local directions.
save(video_data2, file = "youtube_sample_data.RData")
# View the final data.
load("youtube_sample_data.RData")
head(video_data2)
library(tidyverse)
# Transform the data to longer format.
transformed_data <- video_data2 %>%
pivot_longer(cols = c(LikeCount, CommentCount),
names_to = "Statistics")
# Transform the value column into numeric.
transformed_data$value <- as.numeric(transformed_data$value)
# Draw a bar chart of likes and comment counts.
hchart(transformed_data, "bar", hcaes(x = Name, y = value, group = Statistics)) %>%
hc_title(text = "P3. Statistics of the Latest Videos of Sample Artists")
# Transform the view count column into numeric.
video_data2$ViewCount <- as.numeric(video_data2$ViewCount)
# Draw a bar chart of view counts.
hchart(video_data2, "bar", hcaes(x = Name, y = ViewCount)) %>%
hc_title(text = "P3-2. Sample Artist's Latest Video Views")
#install.packages("tm")
#install.packages("wordcloud")
library(tm) # Use the "tm" (Text Mining) package to process the text.
library(wordcloud) # Use this package to generate a wordcloud.
# Prepare the Text Data
comments_text <- paste(video_data2$Comments, collapse=" ")
# Create a Text Corpus and Clean the Data
corpus <- Corpus(VectorSource(comments_text))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
# Create a word cloud of the comments.
wordcloud(corpus, scale=c(4,0.6), max.words=80, random.order=FALSE, rot.per=0, colors=brewer.pal(8, "Set1"))
title(main = "P4. Word Cloud of YouTube Comments")
# this chunk generates the complete code appendix.
# eval=FALSE tells R not to run (``evaluate'') the code here (it was already run before).